[V0] Correct CUDA Graph capture for encoder-decoder models #22630

Sugar-zsg · 2025-08-11T08:30:08Z

This commit addresses a bug where CUDA Graph failed to provide performance benefits for Whisper-style encoder-decoder models.
The previous implementation of CUDA Graph capture incorrectly set the max_seq_len_to_capture based solely on the encoder's max sequence length (448). For models like Whisper, the decoder's max sequence length (1500) is significantly larger. This mismatch caused the captured graph to be too small for the decoder's operations, leading to a failure to properly leverage the feature.
The fix updates the capture logic to correctly determine the max sequence length by considering both the encoder and decoder. By ensuring the captured graph is large enough to handle both components, we can now successfully utilize CUDA Graph for these models, resulting in improved inference performance.

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.

Purpose

Test Plan

Test Result

(Optional) Documentation Update

This commit addresses a bug where CUDA Graph failed to provide performance benefits for Whisper-style encoder-decoder models. The previous implementation of CUDA Graph capture incorrectly set the `max_seq_len_to_capture` based solely on the encoder's max sequence length (448). For models like Whisper, the decoder's max sequence length (1500) is significantly larger. This mismatch caused the captured graph to be too small for the decoder's operations, leading to a failure to properly leverage the feature. The fix updates the capture logic to correctly determine the max sequence length by considering both the encoder and decoder. By ensuring the captured graph is large enough to handle both components, we can now successfully utilize CUDA Graph for these models, resulting in improved inference performance.

gemini-code-assist

Code Review

This pull request fixes a bug with CUDA Graph capture for encoder-decoder models. The change prevents max_seq_len_to_capture from being incorrectly capped by a smaller model length, which is correct. I've suggested a more robust implementation that explicitly considers both encoder and decoder maximum lengths, making the fix more resilient and less dependent on default values.

gemini-code-assist · 2025-08-11T08:32:00Z

vllm/config/__init__.py

+        if not self.is_encoder_decoder:
+            self.max_seq_len_to_capture = min(self.max_seq_len_to_capture,
+                                              self.max_model_len)


While this change correctly addresses the issue for encoder-decoder models by not capping max_seq_len_to_capture, it relies on the default value being sufficiently large. This could be fragile if a user sets a smaller max_seq_len_to_capture or for models with very large encoder sequence lengths.

A more robust approach would be to explicitly determine the maximum length by considering both encoder and decoder sequence lengths, and then use that to cap max_seq_len_to_capture. This aligns better with the PR's goal to 'correctly determine the max sequence length'.

Consider this alternative implementation:

Suggested change

if not self.is_encoder_decoder:

self.max_seq_len_to_capture = min(self.max_seq_len_to_capture,

self.max_model_len)

max_len = self.max_model_len

if self.is_encoder_decoder:

max_len = max(

max_len, getattr(self.hf_config, "max_source_positions", 0))

self.max_seq_len_to_capture = min(self.max_seq_len_to_capture, max_len)

DarkLight1337

Have you considered the impact of this change on mllama? It would be great if you run benchmarks on both whisper and mllama

github-actions · 2025-08-11T09:21:57Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

Sugar-zsg · 2025-08-11T11:03:22Z

Have you considered the impact of this change on mllama? It would be great if you run benchmarks on both whisper and mllama

Thanks for the suggestion.

I have verified the impact on the Whisper-large-v3-turbo model using NVIDIA H20e GPUs. The change reduced the average inference time from approximately 320ms to 190ms for the same audio, which is a significant performance gain. This updated Whisper service has been running successfully in our production environment under high load for several weeks.

Regarding mllama, I haven't used it before, but I will start some basic performance and load tests to evaluate the impact.

Sugar-zsg · 2025-08-11T12:40:13Z

re: @DarkLight1337

Thanks again for the suggestion and for your patience.

I have completed the benchmarks for mllama and can now share the full findings.

For the meta-llama/Llama-3.2-11B-Vision-Instruct model, tested on NVIDIA H20e GPUs, the performance remained stable:

Before the change:
- Token Throughput: 102.25 tokens/s
- Average Latency: 2.504s/sample
After the change:
- Token Throughput: 102.09 tokens/s
- Average Latency: 2.508s/sample

This stability is expected, as the mllama model did not have the same CUDA Graph configuration issue that the whisper model had. CUDA Graph was already correctly enabled for mllama, so my fix did not impact its performance.

As a reminder, this fix provided a significant performance boost for Whisper-large-v3-turbo, reducing average latency from ~320ms to 190ms. This change is stable and has been running in our production environment for weeks.

I believe this change is valuable for models like Whisper, where the encoder-decoder sequence length mismatch prevents CUDA Graph from being fully utilized.

DarkLight1337 · 2025-08-11T12:49:00Z

Ok nice, LGTM then, thanks for optimizing this!

NickLucche

Thanks a lot for fixing this, this is great!

I feel this feature could use a unit test to make sure the V1 port #21088 correctly works with it though.

vllm/config/__init__.py

DarkLight1337 · 2025-08-11T15:12:34Z

PTAL at the failing Worker test

Sugar-zsg · 2025-08-12T03:27:42Z

re: @DarkLight1337
I've resolved the issue with the worker test and added some comments.

…ect#22630)

…ect#22630) Signed-off-by: Paul Pak <[email protected]>

…ect#22630)

Sugar-zsg requested review from simon-mo, WoosukKwon, youkaichao, robertgshaw2-redhat, mgoin, tlrmchlsmth, houseroad and hmellor as code owners August 11, 2025 08:30

gemini-code-assist bot reviewed Aug 11, 2025

View reviewed changes

DarkLight1337 reviewed Aug 11, 2025

View reviewed changes

DarkLight1337 changed the title ~~Correct CUDA Graph capture for encoder-decoder models （V0 engine）~~ [V0] Correct CUDA Graph capture for encoder-decoder models Aug 11, 2025

DarkLight1337 approved these changes Aug 11, 2025

View reviewed changes

DarkLight1337 enabled auto-merge (squash) August 11, 2025 12:49

github-actions bot added the ready ONLY add when PR is ready to merge/full CI is needed label Aug 11, 2025

NickLucche suggested changes Aug 11, 2025

View reviewed changes

vllm/config/__init__.py Outdated Show resolved Hide resolved

Fix the impact on other encoder-decoder models

c8accf5

auto-merge was automatically disabled August 12, 2025 03:24
Head branch was pushed to by a user without write access

Sugar-zsg added 2 commits August 12, 2025 11:39

Standardized format

17d8445

Remove some extra spaces

1eb60fd

vllm-bot merged commit 8d17fa6 into vllm-project:main Aug 12, 2025
34 of 39 checks passed

aarnphm pushed a commit to aarnphm/vllm that referenced this pull request Aug 13, 2025

[V0] Correct CUDA Graph capture for encoder-decoder models (vllm-proj…

8604553

…ect#22630)

paulpak58 pushed a commit to paulpak58/vllm that referenced this pull request Aug 13, 2025

[V0] Correct CUDA Graph capture for encoder-decoder models (vllm-proj…

a4dcf27

…ect#22630) Signed-off-by: Paul Pak <[email protected]>

taneem-ibrahim pushed a commit to taneem-ibrahim/vllm that referenced this pull request Aug 14, 2025

[V0] Correct CUDA Graph capture for encoder-decoder models (vllm-proj…

a44aa44

…ect#22630)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[V0] Correct CUDA Graph capture for encoder-decoder models #22630

[V0] Correct CUDA Graph capture for encoder-decoder models #22630

Uh oh!

Sugar-zsg commented Aug 11, 2025 •

edited by github-actions bot

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Aug 11, 2025

Uh oh!

DarkLight1337 left a comment •

edited

Loading

Uh oh!

github-actions bot commented Aug 11, 2025

Uh oh!

Sugar-zsg commented Aug 11, 2025

Uh oh!

Sugar-zsg commented Aug 11, 2025

Uh oh!

DarkLight1337 commented Aug 11, 2025

Uh oh!

NickLucche left a comment

Uh oh!

Uh oh!

DarkLight1337 commented Aug 11, 2025

Uh oh!

Sugar-zsg commented Aug 12, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

-        if not self.is_encoder_decoder:
-            self.max_seq_len_to_capture = min(self.max_seq_len_to_capture,
-                                              self.max_model_len)
+        max_len = self.max_model_len
+        if self.is_encoder_decoder:
+            max_len = max(
+                max_len, getattr(self.hf_config, "max_source_positions", 0))
+        self.max_seq_len_to_capture = min(self.max_seq_len_to_capture, max_len)

Uh oh!

[V0] Correct CUDA Graph capture for encoder-decoder models #22630

[V0] Correct CUDA Graph capture for encoder-decoder models #22630

Uh oh!

Conversation

Sugar-zsg commented Aug 11, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Essential Elements of an Effective PR Description Checklist

Purpose

Test Plan

Test Result

(Optional) Documentation Update

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Aug 11, 2025

Choose a reason for hiding this comment

Uh oh!

DarkLight1337 left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Aug 11, 2025

Uh oh!

Sugar-zsg commented Aug 11, 2025

Uh oh!

Sugar-zsg commented Aug 11, 2025

Uh oh!

DarkLight1337 commented Aug 11, 2025

Uh oh!

NickLucche left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

DarkLight1337 commented Aug 11, 2025

Uh oh!

Sugar-zsg commented Aug 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Sugar-zsg commented Aug 11, 2025 •

edited by github-actions bot

Loading

DarkLight1337 left a comment •

edited

Loading

Sugar-zsg commented Aug 12, 2025 •

edited

Loading